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Abstract. We prove in this paper that the weighted volume - or generating 
function - of the set of integral transportation matrices between two integral 
histograms r and c of equal sum is a positive definite kernel of r and c when the 
set of considered weights forms a positive definite matrix. The computation of 
this quantity, despite being the subject of a significant research effort in alge- 
braic statistics, remains an intractable challenge for histograms of even modest 
dimensions. We propose an alternative kernel which, rather than considering 
all matrices of the transportation polytope, only focuses on a sub-sample of 
its vertices known as its Northwestern corner solutions. The resulting kernel 
is positive definite and can be computed with a number of operations 0(B?d) 
that grows linearly in the complexity of the dimension d, where R? — the total 
amount of sampled vertices - is a parameter that controls the complexity of 
the kernel. 



1. Introduction 

Suppose that among 30 students in a classroom, 7 and 23 have light and dark 
colored eyes respectively. You are also told that 12 of them have light hair while 
18 have dark hair. What are all the possible populations of the 4 subgroups of 
students with light/light, dark/dark, light/dark and dark/light eyes and hair color 
respectively? Such quantities can be arranged in a 2 x 2 matrix whose row sum 
vector must be equal to [7, 23] T and column sum vector must be equal to [12, 18], 
[| i 4 ] for instance, and more generally any integer values in the dots below that 
satisfy these constraints: 

12 18 

7 

23 

Alternatively, suppose that two bakeries in a small village produce daily 7 and 23 
loafs of bread each, while two restaurants in the same area each need 12 and 18 loafs 
to serve their customers every day. What are all the possible morning delivery plans 
of bread loafs that the two bakeries and shops can agree upon? These seemingly 
trivial sets of matrices coincide, and are known in the statistics and optimization 
literature as the sets of contingency tables and transportation plans respectively. 

In statistics, the problem of enumerating all such tables arises naturally in hy- 
pothesis testing. Suppose that by entering the aforementioned classroom you ob- 
serve that the actual repartition of these groups is [f 1 2 6 ]- Such an observation 
intuitively suggests that eye and hair color are related, but how confident should 
you be about thi s statement? In the 2x2 case presented above, the Fisher exact 
test (jYatesl . Il934h answers that question by computing the probabilities of all pos- 



sible tables outcomes if one assumes that they have been generated as the product 
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of independent Bernoulli variables with law pi = 7/30 and P2 — 12/30. By com- 
paring all these probabilities with that of the observed table, we can conclude how 
reliable an independence hypothesis would be. In optimization, given a 2 x 2 cost 
matrix which describes the cost (in gas, calories or time) of bringing a loaf from 
each bakery to each shop, finding the delivery plan with minimal cost is known 
as a transportation problem. Transportation problems are an extremely general 
class of linear programs which are k nown to encompass all instances of network 
flows (Bertsimas and Tsitsiklid . Il997| p.274). 

Optimal transportation distances (jRachev and Riischendorf . 1998; Villanil . 2009) 
are distances between probability densities which combine both perspectives out- 
lined above, where the probabilistic view on contingency tables is matched with 
the goal of computing an optimal transportation plan between two marginal prob- 
abilities given a metric on the probability space of interest. Such distances have 
been widely used in computer vision following the impulsion of lRubner et al.l ([19971 ) 
who used it to compare histograms of image features. When used in information 
retrieval tasks, transportation dis tances fare usually better in practice than other 
classical distances for histograms ( Pele and Werman . 20091 ). 

Transportation distances have however two notable drawbacks. First, from a 
geometric point of view, transportation distances are deficient in the sense that 
they are not negative definite nor Hilbertian. Negative dcfinitencss carries many 
favorable properties, among which the possibility to create Euclidean embeddings 
from which the metric can be accurately recovered, as well as the possibility to 
turn the distance into a positive definite kernel by simple exponentiation, as a r a- 
dial basis function. Because of this deficiency, there is no known positive definite 
counterpart to transportation distances that can leverage the complexity of the set 
of contingency tables. Second, from a computational point of view, the computa- 
tional cost of computing transportation distances grows in most cases of interest at 
least quadratically in the dimension d of the histograms, which can be prohibitive 
for many applications. 

We try to address both issues in this work. The main contribution of this paper is 
theoretical: after providing some background material and motivation in Section [2] 
we prove in Section [3] that the generating function of the set of all contingency 
tables between two integral histograms is a positive definite kernel. Our second 
contribution is practical: we propose in Section 0] a positive definite kernel that 
leverages these ideas while still being computationally tractable. 



2. Background 

2.1. The Transportation Polytope and the Set of Contingency Tables. We 

review in this section a few definitions, notations and results of interest to prove 
our result. In the following, we write ( • , • ) for both the Frobenius dot-product and 
the usual dot-product of vectors. 

Given a dimension d fixed throughout this paper, for two vectors r, c G M. d , let 
U(r, c) be the transportation polytope of r and c, namely the subset of nonnegative 
matrices in R dxd defined as: 

U(r, c) d ^{X G Mf d | Xl d = r, X T l d = c}, 

where 1^ is the d dimensional vector of ones. U(r, c) contains all nonnegative d x d 
matrices with row and column sums r and c respectively. It is easy to check that 
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U(r, c) is non-empty if and only if all coordinates of r and c are non-negative and 
if the total masses of r and c are the same, that is r T ld — c T ld- We will consider 
in most of this work integral vectors r and c taken in the set £ of d-dimensional 
integral histograms with total mass N G IN, 

We will also focus accordingly on the subset U(r, c) of U(r, c) that conta ins all inte- 



gral transportation matrices, alternatively known as contingency tables (jLauritzen 



1982; lEverittl . 11992 ) 



, dcf 



dxd 



U(r, c) = U(r, c)ni 

2.2. Weighted Volumes of Conting ency Table s and Parti cu lar Cases of 

Posit i vity. Ranging from earl y work bv|Yatesl (ll934l ); lGood" ( 19761) to Diaconis and Efron 
(<1985l) : ICrvan and Dverl (120031 ): IChen et al.1 (120051) . ;he computation of elementary 
statistics about U(r, c) has attracted considerab le attention. Many of the ideas of 
this paper build upon recent work by iBarvinok , most notably on his study of the 



generating function of U(r, c), defined for M G 



V{r,c-M) d = 



E • 

XeV(r,c) 



pdxd 



~{X,M) 



The generating function can be related to the weighted volume ( Barvinok . 20081 
p. 2) of U(r, c), defined for any nonnegative dxd matrix K G F d; ' 

d 



as: 



T(r,c;K) 



clef 



E IK 

XSU(r,c) ij 



Both definitions are equivalent since if we agree that kij — e~~ mij then T(r, c ; K) = 
V(r 7 c ; M). Because all of our results rely on i^'s properties, we will mostly use the 
weighted volume formulation in this paper. Some sections in this paper, notably 
i j2.3l b elow and El are better understood with the generating function formulation. 

Cuturil (|2007l Prop. 2) proved that the cardinal of the set U(r, c) is a positive 
definite kernel of r and c using the Robinson-Schensted-Knuth bijection (IKnuth 



19701) that maps each contigency table to a pair of Young tableaux with contents 
r and c and the same pattern. It is easy to see that the cardinal of U(r, c) is equal 
to T(r, c; l dxd ) or V(r, c ; Owv^l. lCuturl (|2007l Prop.l) also proved that T(r, c ; K) 
is a positive definite kernel of r and c if both are binary histograms and if is a 
nonnegative dxd positive definite matrix. Since the comput ation of T en tails in 
that case the computation of the permanent of a Gram matrix, ICuturil (|2007l ) called 
this kernel the permanent kernel. The main contribution of our paper is to prove 
in Theorem Q] that the map (r, c) G *— > T(r,c;K) is positive definite whenever 
K is a d x d positive definite matrix. 

2.3. Relationships with the Optimal Transportation Distance. Given &dx 
d cost matrix M, one can quantify the cost of mapping r to c using a transportation 
matrix X as {X, M ) . The minimum of this cost is called the optimal transportation 
cost, defined as: 

d M (r,c) d = min (X,M). 
XeU(r,c) 

A classical result of optimization in network flows (jBertsimas and Tsitsiklisl Il997l 
Theo. 7.5) guarantees the existence of a contingency table X* G U(r, c) which 
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Figure 1 . Schematic representation of the set U(r, c) of contin- 
gency tables seen as the intersection between the lattice of integral 
matrices lN dxd with the transportation polytope U(r, c). Each red 
dot stands for an integral plan X S U(r, c). The inner color in each 
red dot stands for the value of (X, M ) , which can be seen to go 
gradually from (X* , M ) to (X°, M), that is from the minimum to 
the maximum of {-,M} over U{r,c) 1 or equivalently U(r, c). The 
generating function V(r,c;Al) of U(r, c) considers the contribu- 
tions of all contingency tables. 



achieves this minimum, as schematically represented in Figure CD Such an optimal 



table X* can be obtained algorithmically in polynomial time ( Ahuia et al. , 19931 
§9). 

The minimal cost dm (f, c ) 



turns out to be a distance ( Villanil . 120091 §6.1) when- 
ever the matrix M is itself a metric. This distance is also known as the Wasserstein 



distance, Monge-Kantorovich's, Mallow's or Earth Mover's (jRubner et all 119971) in 



the computer vision literature. The transportation dis tance is not negative definite 
in the general shown by counterexamples dNaor and Schechtmanl 120071 ) 



and embedding distortion results (|Andoni et all 120091 ) . Although some metrics 



M can yield a negative definite distanc^E characterizing the negative definiteness 
of cLm remains an open question. Despite this fact, transportation dist ances have 
been u sed in p ractice to der i ve a p seudo-positive definite kernel: both iJing et al. 
( 2004 . §4.C) or IZhang et al. ( 20061 §2.3) introduce the exponential of (minus) the 



minimum 



(1) 



of (X,M), 

k M {r, c) 



— p -d M (r,c) _ 



exp 



min (X, M ) 

XeU(r,c) 



Setting M = ldxd ~ Id yields the total variation distance between discrete probabilities, 
which is half the Manhattan or l\ distance between r and c. All these distances are known to be 
negative definite. 
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to form an undefinite kernel which can be used to compare histograms in practice. 
We prove that, although the value exp(— (X* , M)) in itself is not a positive definite 
kernel, the sum of each term exp(— (X, M}) over all possible contingency tables in 
U(r, c) is positive definite when M has suitable properties. The generating function 
V rc can be interpreted as the exponential of (minus) the soft-minimum of {X, M ) 
over all contingency tables, 



V(r, c ; M) = exp I - softmin (X, M) ] = e logEx « < 

' 1 XGV(r,c) > 



- (X,M ) 



ASU(r,c) 



where the soft-minimum of a finite family of scalars (ui) is 

softmin m = f — log e~ Ui . 

i * ^ 



This expression relates our results in this work to previous applications of soft mini- 



mums to derive pos itive definite kernels from combi natorial distances for strings (jVert et al 



20041 ). time series (jCuturi et all . 120071 ) and trees (jShin et all . 120111 ). These ideas 



are summarized in Figure [TJ 



2.4. Generalized Permutations. We close this section by providing some tools 
to prove the result. We write Sn for the group of permutations over the set 
{1, ■ • • , N}. For any vector a of size N and permutation it G Sn, we write q t 
for the permuted vector with coordinates a T = [a n (u 0^(2) • ■ ■ and a p .. q for 

the subvector [a p ■ ■ ■ a q ] when 1 < p < q < N. For two vectors p, 7 of {1, ■ • ■ , d} N , 
the 2 x N array 



(p;t) 



clef 



Pi Pi ■■ ■ PN 
71 72 • ■ • 1n\ ' 

is called a generalized permutation (|Knuth[ll970h . To any gen eralized perm utation 
(p;7) corresponds a dx d integral matrix x(Pi7) defined as ( Fultonl . 19971 p. 41): 



iV 



(2) [x(p ;!)}»= Y, 1 ' 

n=l 

Consider the following example where d = 3, N 
p= [1 222131 3], 7 = [1121333 3], (p; 7) = 



l<i,j<d. 
I and 



12221313 
11213333 



1 2 

2 1 
2 



If we consider now the permutation ir = [3 6 8 5 2 1 4 7] we have that 



p= [1222 1313], ln = [23331 113], (p; 7ff ) 



12221313 
23331113 



,x(p;7tt) 



2 1 

3 

1 1 



Note that if p and 7 have respectively and Ci elements i among their N coefficients 
for all 1 < i < d, then x(p 7) £ U(r, c). One can see above that the corresponding 
histograms are r = [3, 3, 2] and c= [3,1,4] and that both x(p 5 7) and x(p i 7tt) have 
row and column sums r and c. 
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3. The Weighted Volume as a Positive Definite Kernel 

Theorem 1. LetKeR+ xd . The map (r, c) T(r, c ; K) is positive definite if K 
is positive definite. 



The proof relies on the following observation: iBarvinokl ([.2008) showed that the 
weighted volume of U(r, c) of two integral histograms r and c of total mass N can 
be for mulated a s the expectation of the permanent of a random N x N matrix. To 



do so, Barvinokl shows that the weighted volume - a sum indexed over all contigency 



tables X G U(r, c), can be rewritten as a sum indexed over all permutations n in 
Sn, up to a correcting term known as t he Fisher- Yates statistic (Equation ([5]) in 



the Appendix). The crux of iBarvinokf s proof lies in a randomization scheme - 
using draws from the exponential law - to cancel out the Fisher- Yates statistic. We 
adopt a similar route to prove the positivity of T, by proving that the inverse of 
the Fisher- Yates statistic - defined as k 2 below - is itself positive definite to obtain 
the result. 

Proof. Suppose that K G Ri. is positive definite and consider two integral his- 
tograms r, c in Ej^. We represent r as a iV-dimensional vector p £ {!,-•• , d} N , 



r\ times T2 times ra times 

and consider the analogous representation 7 for c. Let k! and k 2 be the following 
kernels on (p, 7): 

JV 



k i(p>7) = JJfe(/"t>7t) > where k(i,j) = k l3 for 1 < i,j < d, 

t=i 

1 1 d 

k 2 (p, 7 ) = — : -•— rTT^ 1 ' where Y = X (p;7). (see EH Eq. ©) 

ri!---r d ! cx\ ■ ■ ■ c d \ f A 

»j 

The kernel k 2 is the inverse of the Fisher- Yates statistic (Equation ([5]) in the Ap- 
pendix) associated to an integral transportation table X and its marginals r and c. 
ki is trivially positive definite. The first group of terms of k 2 is trivially positive 
definite as a product f(r)f(c) where f(r) — r; —\ ■ We prove that the other term, 
the product of factorials of Xij , is positive definite in Lemma |3] using the proof 
strategy of a related result provided in Lemma [2] Lemma |4] proves that when a 
kernel k on two vectors is symmetric (the definition is provided in the lemma), 
the sum J^ttgSn k (P'^) ^ s itself positive definite. We use this result on the prod- 
uct k(p, 7) = ki(p, 7) k 2 (p, 7) which is trivially symmetric as the product of two 
symmetric kernels. We then prove in Lemma [5] that 



^2 K (^'7?r) = T(r,c;K). 



Since the summation over all permutations in the left hand side is positive definite 
by Lemma IH we conclude that T(r, c ; K) is itself a positive definite kernel as the 
product of two positive definite kernels. | 
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4. Northwestern Kernel 



The weighted volume T(r, c ; K) cannot be com puted exactly even for small di- 
mensions d, and approximations (jBarvinokl . 120081 ) are currently both too expensive 
and too loose to be of practical interest in a machine learning context. We adopt 
in this section an alternative approach, in which we propose to restrict the sum 
of elementary contributions exp(— (X, M )) to a subset of extreme points of U(r, c) 
and obtain a kernel whose computational complexity grows linearly in both the 
dimension d and the size of the sample of extreme points. The main tool for this 
approach is provided by the Northwestern corner rule to generate a vertex of U (r, c) , 
which we recall in Section |4~T1 We define the Northwester kernel in Section l4~2l and 
prove that it is positive definite. For any matrix M G W ixd , we write M aa i for the 
row and column permuted matrix whose i,j element is J7V(iw (j) . 

4.1. The Northwestern Corner Rule to Generate Vertices of U(r,c). The 
Northwestern corner rule is a heuristic that produces a vertex of the polytope 
U(r,c) in up to 2d operations. The rule starts by giving the highest possible value 
to X\x, and at each step when a highest possible value is given to entry x^ it 
moves on to xy+i in case filled column j, or Xj+ij in case xy filled row i. The 
rule proceeds until x nn has received a value. Here is an example of this sequence 
assuming r = [2, 5, 3] and c = [5, 1, 4]: 



• 
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Q" 
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We write NW (r,c) for the unique Northwestern corner solution that can be ob- 
tained through this heuristic. There is, however, a much larger number of North- 
western corner solutions that can be obtained by permuting arbitrarily the order 
of r and c separately, computing the corresponding Northwestern corner table, and 
recovering a table of U(r, c) by inverting again the order of columns and rows. 
Setting a = (3, 1,2), a' = (3,2,1) we have that r a = [3,2, 5], cv = [4,1,5] and 
a- 1 = (2, 3, 1), a' = (3, 2, 1). Observe that: 



NW(r CT , C ;) 



3 
1 1 

5 



GU(r ff ,cv), NW CT -i CT ,-i (r CT ,cv) 



1 1 
5 
3 



G U(r, c). 



Let Af(r, c) be the set of all Northwestern corner solutions that can be produced 
this way: 

Af(r,c)={NW a -i a ,-i(r a , Ca ,),a,a' e S d }. 

Note that all Northwestern corner solutions only have by construction up to 2d — 1 
nonzero elements. The Northwestern corner rule produces a table which is by 
construction unique for r and c, but there is an exponential num ber of pairs or 
row/column permutations (a, a') that may share the same table |Stougie |. | 2002 l 
p. 2). 7V(r, c) is a subset of the set of extreme points of U(r,c) (lBrualdil 120061 
Corollary 8.1.4). NW(r, c) is an optimal tran sportation between r and c if the cost 
matrix M is a Monge matrix ( Hoffman! . 196lh , that is a matrix M that satisfies the 
inequalities 

VI < i, j, k, I < d, niij + m ki < m u + m kj . 



s 
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Note however that a distance matrix cannot be a Monge matrix since the inequality 
above applied to k = j and I = i would imply that < Irriij < ma + rrijj = 0. 

4.2. Random Sampling of Northwestern Corner Solutions. We propose in 
this section a kernel which uses arbitrary row/column permutations of r and c to 
recover extreme points of U(r, c) and sum their individual contribution: 

Theorem 2. Let R be an arbitrary subset of permutations in Sd- The Northwestern 
kernel sampled on R and parameterized by a matrix M , defined as 

N(r,c;K,R)^ 1 £ exp (-{M, NW^-v-x (r ff> <v) )) , 

<r,a'ER 

is a positive definite kernel if K , the element-wise exponential of —M, is positive 
definite. 

Proof. As in the proof of Theorem [TJ consider the representation of an integral 
histogram r G as a N dimensional vector p that replicates n times the index i 
for all i from 1 to d. We also define, for any permutation a of Sd, the vector p a as 

p„^{o{\),--- ,<7(1),<7(2),--- ,fr(2),.-. ,a(d),.-. ,a(d)}. 

V v ' V v ' V v ' 

times r ^(2) times r &{d) times 

p a for a G Sd should not be confused with p^ for 7r e Sn ( §2.4$ : for any permu- 
tation a G Sd there exists at least one permutation ir G Sn such that p a = p n 
but the converse is not usually true. We show in Lemma [T] that for a, a' G Sd, 
NW^-v-i^,^) = x{p<y,-fa>), and thus, 

N(r,c;K,R) = e-< M >xl*"T» = £ Mp^,), 

where ki is defined in Theorem [1] N(r,c; K, R) is positive definite as a convolution 
kernel. I 

Lemma 1. Let a and a' be two permutations of Sd- Then 

(r CT ,cv) = x{Pa,la')- 

Proof. We write Eij for the d x d matrix of zeros except for the element 
set to 1. We prove the result by induction on the total mass N. For N = 1 
the result is trivial since the only transportation matrix in U(r,c) in that case is 
■^o-(ii)CT(i 2 )j where i\ and ii are such that = Ci 2 = 1. Suppose now that the 
result is true for all histograms of mass N and consider the case where r T ld = 
c T ld = N + 1. Let ii and 12 be the smallest indices such that Ta-U) > and 
c cr'(i) > respectively. As a consequence, the first elements of p a and 7^/ are a(i\) 
and 17(12) respectively. Consider the two vectors p* and 7* of length N equal to 
p a and 7cr' without these two first elements. Setting f and c to r and c except for 
the fact that r^u^ = tv^) — 1 and c CT (i 2 ) = r ff (i 2 ) — 1, we have by induction that 
NW CT - (f a , c a >) = x(p*j7*)j since the two histograms have total mass N and 
their representations are respectively p* and 7* . By definition of the Northwestern 
corner rule, adding a unit of mass to the i\S and i%s components of f a and c CT ' only 
changes the very first iteration of the rule, since all coordinates of f„ and cv up 
to but not including i\ and ii respectively are null by construction. Applying the 
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rule yields a transportation table with an added unit in location (ii, ia), providing 
thus the identity 

NW(rv,(v) = NW(f ff) ^) + E ilia , 

which implies that 

(3) NW cr -i a ,-i(r a ,c a/ ) = NW <T -i ./- 1 (f (r ,c ( r/) + E^y,^. 
By definition of x we have that 

(4) X(Pala) =X(P*)7*)+-^<7(i 1 )<7'(» 3 ) 

we get by combining Equations Q and ([3]) above with the induction hypothesis 
that NW .-i .'-i(r <7 ,c ./) = x{p<y,l<?')- ■ 

Remark 1. The evaluation of N(r,c; K, R) requires 0(<i|i?| 2 ) steps since comput- 
ing each of the \R\ 2 contributions exp(— (M, NW CT - i ff /-i (r CT , c a i) )) for a couple a, a' 
requires up to 2d products. The size of R C Sd can be controlled from a few permu- 
tations to an exhaustive enumeration, which would entail an overall complexity of 
the order ofO(dd\ 2 ). 

5. Conclusion and Future Work 

We have proved in this paper that the fundamental ingredient of transportation 
distances, the polytope of contingency tables, can be used to define a positive 
definite kernel between two histograms. While the cost matrix of a transportation 
problem between two histograms r and c needs to be a distance matrix for the 
optimum to be itself a distance of r and c, we have proved that the generating 
function of the same polytope is positive definite whenever the cost matrix is itself 
positive definite. This quantity is computationally intractable, and we have resorted 
to a summation that only considers a subset of extreme points of the polytope to 
define the north-western kernel. Future research includes the proposal of suitable 
subsets R of permutations of Sd tuned with data, as well as other approximation 
schemes. 

Appendix: Intermediate Results for the Proof of Theorem Q] 

Lemma 2. Let a, b £ {0, 1}^ be two binary vectors. The kernel (a,b) n> (a,b)l is 
positive definite. 

Proof. For N = 1 the kernel is always equal to 1 and is thus trivially positive 
definite. For N > 1, the recursion (a, 6)! = (af _1 , bf ~ 1 )! (a N b N (a? _1 , &f ^ ) + 1) 
provides the expression 

JV-l 

(a,b)l = (a t +ibt+i(at..t,bi..t) + 1) , 
t=i 

which shows that {a,b) \ is the product of N — 1 positive definite kernels on different 
features of a and b. ■ 

Remark 2. Rather than the lemma itself, we will use the identity above in the 
proof of Lemma\^ We conjecture that this result can be extended to integral vectors. 
Numerical counterexamples show that this result cannot be generalized to vectors of 
M. N through Euler's or Hadamard's T function. 
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Lemma 3. Let p,7 6 {1, • • • ,d} N . The kernel (p, 7) n- LJrj where X — 
x(p;7), is positive definite. 

Proof. An integral vector p E {1, • ■ ■ , d} N with N components can be represented 
as a family of d binary row vectors p , • • • , p d of length N where for n < N, 

p l n = f l p „=i- For instance, 



if p= [1 122213133] , then 



110 
11 




1 

1 




10 

10 11 



These d binary vector representations can be used to obtain the matrix x{p\l)- 
Indeed, it is easy to check that if X = x(p , 7) then Xij — (p l , 7 J ). As a consequence, 
we have that for all indices i, j the coefficient x$jl — (p 1 ,^ )!. We obtain that the 
product of factorials 

d d 
ij i,j 

is thus a product of kernels evaluated on all possible pairs among the d x d repre- 
sentations for p and 7. Although one might be tempted to interpret this product 
as a c onvolution kernel ( Haussler , 1999h or a mapping kernel ( Shin and Kubovama , 



20081 ). one should recall that such results only apply to sums of local kernels and 



not to products. Such products of kernels on parts are not, as simple counterexam- 
ples can show, positive definite in the general case. Using the decomposition which 
was used in the proof of Lemma [H we have however that: 



d N-l 



i[x t3 i = j[(p\^ )! = n n (p\+iii + M-t,iit ) + 1) > 



1,3 

N-l d 



i,j t=l 



N-l 



nn(4iiiw*it) +i ) = n ( i+u^+i^.w..*) 

«=1 id 



t=l 



d 



where we have used in the last operation the fact that only one of all d 2 products 
{p\+ili+i)ij i s nonzero, since 



Pt+iTt+i 



1, if p t+ i = i and j t+1 = j, 
0, else. 



The product of factorials is thus a product of N — 1 positive definite kernels indexed 
by t and defined on p and 7, where each of these N — 1 kernel is 1 plus a convolution 
kernel operating on the d decompositions of p\..t and 71.-4 as d binary feature vectors, 
that is 

d N-l 



[]>«!= LI (1 + MP,7)): 



ij t=l 



where 



d 

h(p,j) = y^hip 1 ,! ) and h t (a,b) = a t+ ibt + i(a 1 .. t ,b 1 ..t). 
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Lemma 4. Let a — (ari, • • • , ajv) and fj — • • • , 0n) be two lists of N elements 
in a set X . Let k be a symmetric kernel in X N , that is a kernel invariant under 
a permutation of the order of both a and /3: W G Sjy, k(a, /3) = k(ct K ,p w ). Then 
(a, fj) t— > X^ttsSjv k( a > At) * s positive definite. 

Proof. The function g denned below is, by iHaussleit s (| 19991) convolution kernels 



framework, a positive definite kernel of a and f3: 

g{a,P)= 5Z 5Z fc(av,Ar)- 

Using the symmetric property of tt, we have that 

<?(a,/3) = ^ ^ k(a,P 7r ,- lo7r ) = N\ ]T fc(a,Ar). 

which proves the result. ■ 

Lemma 5. X^eSjv K (P>7*) = r i ! ' ' " r <* ! ' c i ! ' ■ ■ c d \T(r,c; K) 

Proof. For any couple of vectors p, 7 we have that both ki and k2 only depend on 
X = x(p i 7)- This is implicitly the case in the definition of k 2 and one can check 
that 

N d 

k i(p, 7) = n fc (^*' ^) = n K] 3 . where ^ = x(p ; 7)- 
t=i zj 

With every permutation 7r of we associate a tran sportation table \(p j 7tt) which we 
call the pattern of 7r. Following (Barvinok, 20081 §2, p. 7), we know that the number 



of permutations tt that share the same pattern X for X G U(r, c) only depends on 
X, r and c through a formula known as the Fisher- Yates statistic n(X) of X, 

(5) u(AJ = cardjTT G 5jv| x(/o;7^) = x \ = ~ 



Yiij x ij ' 



We thus have that 
53 k (P'7tt)= 5Z «(^)ki(p,7 Tr )k 2 (p,7 Tr ) 

ttSSn JfeU(r,c) 



v n! ■ ■ ■ r d l ■ Cl ! ■ • ■ c d ! A a< . nfj fgj _ , . , 

X£U(r,c) llij-Hr ij (LI a, 
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